Logistic Regression Model
- Overview Logistic Regression is used when the dependent variable(target) is categorical, as in this case.
Types of Logistic Regressio -Binary Logistic Regression The categorical response has only two 2 possible outcomes. Example: Spam or Not -Multinomial Logistic Regression Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan) -Ordinal Logistic Regression Three or more categories with ordering. Example: Movie rating from 1 to 5
p value is an important evaluation metric.p-value helps you to decide whether there is a relationship between two variables or not.
The smaller the p-value this mean the more confident you are about the existence of relationship between the two variables. The origins of p-values comes form hypothesis testing in statistics. In hypothesis testing, you have two hypothesis: H0 (called the null hypothesis ) : There is no relationship between the two variables. H1 (called the alternative hypothesis): There exist a relationship between the two variables.
If the p-value is less than small threshold (often 0.05 is used), then you can reject the null hypothesis H0, which means that you decide that there is a relationship between the two variables.
- Run the model against training data using the Binomial algo
Looking at the p values the following variables are predictors of target variable
- Gender(sex)
- Type of Chest Pain (cp)
- Resting ECG result (restecg)
- Exercise induced Angina (exang)
- Stress Test depression (oldpeak)
- Number of colored veins (ca)
- Evaluate and refine the model - Compute average prediction for true outcomes TP are 78%, that is predicting presence of heart disease correctly 78% of the time TN are 26%, that is predicting no heart disease when there is none 26% of the time
##
## Call:
## glm(formula = target ~ age + sex + cp + trestbps + chol + fbs +
## restecg + thalach + exang + oldpeak + slope + ca, family = binomial,
## data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4422 -0.3588 0.1597 0.4854 2.3002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6865 0.8106 0.847 0.397009
## age -0.4887 0.2563 -1.907 0.056579 .
## sex -1.7608 0.5116 -3.442 0.000578 ***
## cp 0.6577 0.2180 3.018 0.002548 **
## trestbps -0.4288 0.2120 -2.023 0.043058 *
## chol -0.4255 0.2445 -1.740 0.081816 .
## fbs 0.3893 0.6726 0.579 0.562792
## restecg 0.6065 0.4000 1.516 0.129446
## thalach 0.2833 0.2574 1.101 0.271035
## exang -1.1019 0.4682 -2.353 0.018608 *
## oldpeak -0.7465 0.2699 -2.766 0.005682 **
## slope 0.6164 0.4415 1.396 0.162670
## ca -0.9816 0.2314 -4.242 2.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.54 on 227 degrees of freedom
## Residual deviance: 158.16 on 215 degrees of freedom
## AIC: 184.16
##
## Number of Fisher Scoring iterations: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001613 0.156975 0.637273 0.552632 0.922450 0.999223
## 0 1
## 0.2491418 0.7983138
- Find the threshold probability to delineate between heart disease= 0 or 1
- ROC curve will help find the threshold
- We want high TRUE POSITIVES or Sensitivity for diagnosing heart disease, and are ok with higher false positive
- Therefore the threshold is at (0.9,0.2) where 90% with heart disease are diagnosed correctly
- At this the threshold is 0.5. This implies that a probability above 0.5 should be classified as heart disease

- Evaluate model by predicting on test dataset and generating the ROC curve
- With 0.5 as threshold, prediction on test dataset leads to accuracy of 89%
The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:
.90-1 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F)
##
## FALSE TRUE
## 0 24 12
## 1 8 31

##
## Call:
## roc.default(response = testData$target, predictor = lrtest.prob, plot = TRUE, col = "blue")
##
## Data: lrtest.prob in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.8533
- Plot the logistic regression model against predictor variables
- Gender(sex)
- Type of Chest Pain (cp)
- Resting Heart Rate (restecg)
- Exercise induced Angina (exang)
- Stress Test depression (oldpeak)
- Number of colored vessels (ca)
- Gender versus Target variable
- Heart disease is more prevalent in females than males

- Type of Chest Pain versus Target variable
- Heart disease is predicted by types 1,2 and 3 of chest pain

- Resting ECG result versus Target variable
Higher ECG result predics heart disease

- Exercise Induced Angina versus Target variable
Higher exercise induced angina predicts heart disease

- Stress Test Depression versus Target variable
Lower depression during stress test predicts heart disease

- Number of Colored Vessels versus Target variable
- Heart disease is predicted by lower number of colored vessels

- Type of Chest Pain versus Target variable
- Heart disease is predicted by types 1,2 and 3 of chest pain

- Resting ECG result versus Target variable
Higher ECG result predics heart disease

- Exercise Induced Angina versus Target variable
Higher exercise induced angina predicts heart disease

- Stress Test Depression versus Target variable
Lower depression during stress test predicts heart disease

- Number of Colored Vessels versus Target variable
- Heart disease is predicted by lower number of colored vessels

Decision Tree algorithm
Overview A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It’s visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.
Run the model against training data, predict target on test data
Run the model against training data, predict target on test data
Evaluate the model -
ROC curve shows accuracy of this model is 82%


##
## Call:
## roc.default(response = testData$target, predictor = factor(dt.predict, ordered = TRUE), plot = TRUE, col = "blue")
##
## Data: factor(dt.predict, ordered = TRUE) in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.7179
Artificial Neural Networks (ANN) Model
Overview
Run the model against training data, predict target on test data
Evaluate the model -
ROC curve shows accuracy of this model is 82%

##
## Call:
## roc.default(response = testData$target, predictor = factor(annResult, ordered = TRUE), plot = TRUE, col = "blue")
##
## Data: factor(annResult, ordered = TRUE) in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.7959
Gradient Boosting Model
- Overview This uses ensemble models like weak decision trees. These decision trees combine together to form a strong model of gradient boosting.
Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
Run the model against training data, predict target on test data
Evaluate the model -
ROC curve shows accuracy of this model is 86%

##
## Call:
## roc.default(response = testData$target, predictor = gbm.test, plot = TRUE, col = "red")
##
## Data: gbm.test in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.797
##
## Call:
## roc.default(response = testData$target, predictor = gbm.test, plot = TRUE, col = "red")
##
## Data: gbm.test in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.797